11 research outputs found

    Spanish sentiment analysis in Twitter at the TASS workshop

    Full text link
    [EN] This paper describes a support vector machine-based approach to different tasks related to sentiment analysis in Twitter for Spanish. We focus on parameter optimization of the models and the combination of several models by means of voting techniques. We evaluate the proposed approach in all the tasks that were defined in the five editions of the TASS workshop, between 2012 and 2016. TASS has become a framework for sentiment analysis tasks that are focused on the Spanish language. We describe our participation in this competition and the results achieved, and then we provide an analysis of and comparison with the best approaches of the teams who participated in all the tasks defined in the TASS workshops. To our knowledge, our results exceed those published to date in the sentiment analysis tasks of the TASS workshops.This work has been partially funded by the Spanish MINECO and FEDER founds under project ASLP-MULAN: Audio, Speech and Language Processing for Multimedia Analytics, TIN2014-54288-C4-3-R.Pla Santamaría, F.; Hurtado Oliver, LF. (2018). Spanish sentiment analysis in Twitter at the TASS workshop. Language Resources and Evaluation. 52(2):645-672. https://doi.org/10.1007/s10579-017-9394-7S645672522Álvarez-López, T., Juncal-Martínez, J., Fernández-Gavilanes, M., Costa-Montenegro, E., González-Castaño, F.J., Cerezo-Costas, H. , & Celix-Salgado, D. (2015). GTI-gradiant at TASS 2015: A hybrid approach for sentiment analysis in Twitter. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 35–40), Alicante, Spain, September 15, 2015.Álvarez-López, T., Fernández-Gavilanes, M., García-Méndez, S., Juncal-Martínez, J., & González-Castaño, F.J. (2016). GTI at TASS 2016: Supervised approach for aspect based sentiment analysis in Twitter. In Proceedings of TASS 2016: Workshop on sentiment analysis at SEPLN co-located with 32nd SEPLN conference (SEPLN 2016) (pp. 53–57), Salamanca, Spain, September 13th, 2016.Araque, O., Corcuera, I., Román, C., Iglesias, C. A., & Sánchez-Rada, J. F. (2015). Aspect based sentiment analysis of Spanish tweets. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 29–34), Alicante, Spain, September 15, 2015.Balahur, A., & Perea-Ortega, J. M. (2013). Experiments using varying sizes and machine translated data for sentiment analysis in Twitter. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Barbosa, L., & Feng, J. (2010). Robust sentiment detection on Twitter from biased and noisy data. In Proceedings of the 23rd international conference on computational linguistics: posters, association for computational linguistics (pp. 36–44).Batista, F., & Ribeiro, R. (2012). The L2F Strategy for Sentiment Analysis and Topic Classification. Technical report, http://www.sepln.org/workshops/tass/2012/participation.php .Casasola Murillo, E., & Marín Raventós, G. (2016). Evaluación de Modelos de Representación del Texto con Vectores de Dimensiónn Reducida para Análisis de Sentimiento. In Proceedings of TASS 2016: Workshop on sentiment analysis at SEPLN co-located with 32nd SEPLN conference (SEPLN 2016) (pp. 23–28), Salamanca, Spain, September 13th, 2016.Castellano, A., Cigarrán, J. & García-Serrano, A. (2012). UNED @ TASS: Using IR techniques for topic-based sentiment analysis through divergence models. Technical report, http://www.sepln.org/workshops/tass/2012/participation.php .Castellanos-González, A., Cigarrán-Recuero, J. & García-Serrano, A. (2013). UNED LSI @ TASS 2013: Considerations about textual representation for IR based tweet classification. In: Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Cerón-Guzmán, J. A. (2016). JACERONG at TASS 2016: An ensemble classifier for sentiment analysis of Spanish tweets at global level. In: Proceedings of TASS 2016: Workshop on sentiment analysis at SEPLN co-located with 32nd SEPLN conference (SEPLN 2016) (pp. 35–39), Salamanca, Spain, September 13th, 2016.del-Hoyo-Alonso, R., Hupont, I., & Lacueva, F. (2013). Affective polarity word discovering by means of artificial general intelligence techniques. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.del-Hoyo-Alonso, R., de la Vega Rodrigalvarez-Chamorro, M., Vea-Murguía, J., & Montañes-Salas, R. M. (2015). Ensemble algorithm with syntactical tree features to improve the opinion analysis. In Proceedings of TASS 2015: workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 53–58), Alicante, Spain, September 15, 2015.Deriu, J., Gonzenbach, M., Uzdilli, F., Lucchi, A., De Luca, V., & Jaggi, M. (2016). Swisscheese at semeval-2016 task 4: Sentiment classification using an ensemble of convolutional neural networks with distant supervision. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016) (pp. 1124–1128), Association for Computational Linguistics, San Diego, California, http://www.aclweb.org/anthology/S16-1173 .Díaz-Galiano, M. C., & Montejo-Ráez, A. (2015). Participación de SINAI DW2Vec en TASS 2015. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 59–64), Alicante, Spain, September 15, 2015.Fernández, J., Gutiérrez, Y., Tomás, D., Gómez, J. M. & Martínez-Barco, P. (2015). Evaluating a sentiment analysis approach from a business point of view. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 93–98), Alicante, Spain, September 15, 2015.Fernández, J., Gutiérrez, Y., Gómez, J.M., Martínez-Barco, P., Montoyo A., & Muñoz, R. (2013). Sentiment analysis of Spanish Tweets using a ranking algorithm and skipgrams. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Frank, E., Hall, M. A., & Witten, I. H. (2016). The WEKA workbench. Online appendix for “Data mining: Practical machine learning tools and techniques” (4th ed.). Burlington: Morgan Kaufmann.Gamallo, P., García, M. & Fernández-Lanza, S. (2013). TASS: A Naive-Bayes strategy for sentiment analysis on Spanish tweets. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.García Cumbreras, M. Á., Martínez Cámara, E., Villena-Román, J., & García Morera, J. (2016a). TASS 2015—The evolution of the Spanish opinion mining systems. Procesamiento del Lenguaje Natural.García Cumbreras, M. Á., Villena Román, J., Martínez Cámara, E., Díaz Galiano, M. C., Martín Valdivia, M. T., & Ureña López, L. A. (2016b). Overview of TASS 2016. In Proceedings of TASS 2016: Workshop on sentiment analysis at SEPLN co-located with 32nd SEPLN conference (SEPLN 2016) (pp. 13–21), Salamanca, Spain, September 13th, 2016.García, D., & Thelwall, M. (2013). Political alignment and emotional expression in Spanish Tweets. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Hagen, M., Potthast, M., Büchner, M., & Stein, B. (2015). Webis: An ensemble for twitter sentiment detection. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (pp. 582–589), Association for Computational Linguistics, Denver, Colorado, http://www.aclweb.org/anthology/S15-2097 .Hamdan, H., Bellot, P., & Bechet, F. (2015). Lsislif: Crf and logistic regression for opinion target extraction and sentiment polarity analysis. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (pp. 753–758), Association for Computational Linguistics, Denver, Colorado, http://www.aclweb.org/anthology/S15-2128 .Hernández Petlachi, R., & Li, X. (2014). Análisis de sentimiento sobre textos en Español basado en aproximaciones semánticas con reglas lingüísticas. In Proceedings of the TASS workshop at SEPLN 2014.Hurtado, L.F. , & Pla, F. (2014). ELiRF-UPV en TASS 2014: Análisis de Sentimientos, Detección de Tópicos y Análisis de Sentimientos de Aspectos en Twitter. In Proceedings of the TASS workshop at SEPLN 2014.Hurtado, L. F., & Pla, F. (2016). ELiRF-UPV en TASS 2016: Análisis de Sentimientos en Twitter. In Proceedings of TASS 2016: Workshop on sentiment analysis at SEPLN co-located with 32nd SEPLN conference (SEPLN 2016) (pp. 47–51), Salamanca, Spain, September 13th, 2016.Hurtado, L. F., Pla, F., & Buscaldi, D. (2015). ELiRF-UPV en TASS 2015: Análisis de Sentimientos en Twitter. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 75–79), Alicante, Spain, September 15, 2015.Jansen, B. J., Zhang, M., Sobel, K., & Chowdury, A. (2009). Twitter power: Tweets as electronic word of mouth. Journal of the American Society for Information Science and Technology, 60(11), 2169–2188.Jiménez Zafra, S. M., Martínez Cámara, E., Martín Valdivia, M. T., & Ureña López, L. A. (2014) SINAI-ESMA: An unsupervised approach for sentiment analysis in Twitter. In Proceedings of the TASS workshop at SEPLN 2014.Liu, B. (2012). Sentiment analysis and opinion mining. A comprehensive introduction and survey. San Rafael: Morgan & Claypool Publishers.Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th international conference on world wide web (pp. 342–351), ACM, New York, NY, USA, WWW ’05, doi: 10.1145/1060745.1060797 , http://doi.acm.org/10.1145/1060745.1060797Martínez-Cámara, E., Martín-Valdivia, M. T., Ureña-López, L. A., & Montejo-Raéz, A. (2014). Sentiment analysis in Twitter. Natural Language Engineering, 1(1), 1–28.Martínez-Cámara, E., García-Cumbreras, M.Á., Martín-Valdivia, M. T., & López, L. A. U. (2015). SINAI-EMMA: Vectores de Palabras para el Análisis de Opiniones en Twitter. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 41–46), Alicante, Spain, September 15, 2015.Martín-Wanton, T., & de Albornoz, J. C. (2012). UNED at TASS 2012: Polarity classification and trending topic system. Technical report, http://www.sepln.org/workshops/tass/2012/participation.php .Martínez-Cámara, E., Ángel García-Cumbreras, M., Martín-Valdivia, M. T., & Ureña-López, L. A. (2013). SINAI-EMML: Combinación de Recursos Lingüíticos para el Análisis de la Opinión en Twitter. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Martínez-Cámara, E., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2013). Bilingual experiments on an opinion comparable corpus. In Proceedings of the 4th workshop on computational approaches to subjectivity, sentiment and social media analysis (pp. 87–93).Mendizabal, I., & Carandell, J. (2015). BittenPotato: Tweet sentiment analysis by combining multiple classifiers. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 71–74), Alicante, Spain, September 15, 2015.Mohammad, S., Kiritchenko, S., & Zhu, X. (2013). Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets. In Second joint conference on lexical and computational semantics (*SEM), Volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 321–327), Association for Computational Linguistics, Atlanta, Georgia, USA, http://www.aclweb.org/anthology/S13-2053 .Montejo-Ráez, A., & Díaz-Galiano, M. C. (2016). Participación de SINAI en TASS 2016. In Proceedings of TASS 2016: Workshop on sentiment analysis at SEPLN co-located with 32nd SEPLN conference (SEPLN 2016) (pp. 41–45), Salamanca, Spain, September 13th, 2016.Montejo-Ráez, A., Díaz-Galiano, M. C., & García-Vega, M. (2013). LSA based approach to TASS 2013. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Montejo-Ráez, A., García-Cumbreras, M., & Díaz-Galiano, M. (2014). Participación de SINAI Word2Vec en TASS 2014. In Proceedings of the TASS workshop at SEPLN 2014.Moreno-Ortiz, A., & Pérez-Hernández, C. (2012). Lexicon-based sentiment analysis of Twitter messages in Spanish. Technical report, http://www.sepln.org/workshops/tass/2012/participation.php .Nakov, P., Kozareva, Z., Ritter, A., Rosenthal, S., Stoyanov, V., & Wilson, T. (2013). SemEval-2013 Task 2: Sentiment analysis in Twitter.Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V., & Sebastiani, F. (2016). SemEval-2016 Task 4: Sentiment analysis in Twitter. In Proceedings of the 10th international workshop on semantic evaluation (pp. 1–18), Association for Computational Linguistics, San Diego, California, SemEval ’16.O’Connor, B., Krieger, M., & Ahn, D. (2010). TweetMotif: Exploratory search and topic summarization for Twitter. In Cohen, W. W. & Gosling, S. (Eds)., Proceedings of the fourth international conference on weblogs and social media, ICWSM 2010, Washington, DC, USA, May 23-26, 2010, The AAAI Press, http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1540 .Padró, L., & Stanilovsky, E. (2012). FreeLing 3.0: Towards Wider Multilinguality. In Proceedings of the language resources and evaluation conference (LREC 2012), ELRA, Istanbul, Turkey.Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of EMNLP (pp. 79–86).Park, S. (2015). Sentiment Classification Using Sociolinguistic Clusters. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 99–104), Alicante, Spain, September 15, 2015.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.Perea-Ortega, J. M. & Balahur, A. (2014). Experiments on feature replacements for polarity classification of Spanish tweets. In Proceedings of the TASS workshop at SEPLN 2014.Perez-Rosas, V., Banea, C., & Mihalcea, R. (2012). Learning Sentiment Lexicons in Spanish. In: N. C. C. Chair, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eight international conference on language resources and evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey.Pla, F., & Hurtado, L. F. (2013a) ELiRF-UPV en TASS-2013: Análisis de sentimientos en Twitter. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Pla, F., & Hurtado, L. F. (2013b) ELiRF-UPV en TASS-2013: Análisis de sentimientos en Twitter. In XXIX Congreso de la Sociedad Espanola para el Procesamiento del Lenguaje Natural (SEPLN 2013) TASS (pp. 220–227).Pla, F., & Hurtado, L. F. (2014a) Political tendency identification in Twitter using sentiment analysis techniques. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical Papers (pp. 183–192), Dublin City University and Association for Computational Linguistics, Dublin, Ireland, http://www.aclweb.org/anthology/C14-1019 .Pla, F., & Hurtado, L. F. (2014b) Sentiment analysis in Twitter for Spanish. In International conference on applications of natural language to data bases/information systems (pp. 208–213), Springer International Publishing.Quirós, A., Segura-Bedmar, I., & Martínez, P. (2016). LABDA at the 2016 TASS challenge task: Using word embeddings for the sentiment analysis task. In Proceedings of TASS 2016: workshop on sentiment analysis at SEPLN co-located with 32nd SEPLN conference (SEPLN 2016) (pp. 29–33), Salamanca, Spain, September 13th, 2016.Ramón Quevedo, J., Luaces, O., & Bahamonde, A. (2012). Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn, 45(2), 876–883.Rosenthal, S., Nakov, P., Ritter, A., & Stoyanov, V. (2014). SemEval-2014 Task 9: Sentiment analysis in Twitter. In: P. Nakov, T. Zesch (Eds.), Proceedings of the 8th international workshop on semantic evaluation, SemEval ’14, Dublin, Ireland.Rosenthal, S., Nakov, P., Kiritchenko, S., Mohammad, S., Ritter, A., & Stoyanov, V. (2015). SemEval-2015 Task 10: Sentiment analysis in Twitter. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (pp. 451–463), Association for Computational Linguistics, Denver, Colorado, http://www.aclweb.org/anthology/S15-2078 .Rouvier, M., & Favre, B. (2016). SENSEI-LIF at SemEval-2016 task 4: Polarity embedding fusion for robust sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016) (pp. 202–208), Association for Computational Linguistics, San Diego, California, http://www.aclweb.org/anthology/S16-1030 .San Vicente Roncal, I., & Saralegi Urizar, X. (2014). Looking for features for supervised tweet polarity classification. In Proceedings of the TASS workshop at SEPLN 2014.Santos-Deas, M., Biran, O., McKeown, K., & Rosenthal, S. (2015). Spanish Twitter messages polarized through the lens of an english system. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 81–86), Alicante, Spain, September 15, 2015.Saralegi, X., & San Vicente, I. (2012). TASS: Detecting sentiments in Spanish tweets. Technical report, http://www.sepln.org/workshops/tass/2012/participation.php .Saralegi, X., & San Vicente, I. (2013). Elhuyar at TASS 2013. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. doi: 10.1145/505282.505283 .Segura-Bedmar, I., Quiròs, A., & Martìnez, P. (2017). Exploring convolutional neural networks for sentiment analysis of Spanish tweets. In Proceedings of EACL (15th conference of the European chapter of the Association for Computational Linguistics) (pp. 1014–1022), Association for Computational Linguistics.Severyn, A., & Moschitti, A. (2015). Unitn: Training deep convolutional neural network for twitter sentiment classification. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (pp. 464–469), Association for Computational Linguistics, Denver, Colorado, http://www.aclweb.org/anthology/S15-2079 .Siordia, O. S., Moctezuma, D., Graff, M., Miranda-Jiménez, S., Téllez, E. S., & Villaseñor, E. (2015). Sentiment analysis for Twitter: TASS 2015. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN Conference (SEPLN 2015) (pp 65–70), Alicante, Spain, September 15, 2015.Sixto-Cesteros, J., Almeida, A., & López-de-Ipiña, D. (2015). DeustoTech Internet at TASS 2015: Sentiment analysis and polarity classification in Spanish tweets. In: Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 23–28), Alicante, Spain, September 15, 2015.Trilla, A., & Alías, F. (2012). Sentiment analysis of Twitter messages based on multinomial Naive Bayes. Technical report, http://www.sepln.org/workshops/tass/2012/participation.php .Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 2007, 1–13.Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In ACL (pp. 417–424), http://www.aclweb.org/anthology/P02-1053.pdf .Valverde-Tohalino, J., & Tejada-Cárcamo, J. (2015). Comparing supervised learning methods for classifying Spanish tweets. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 87–92), Alicante, Spain, September 15, 2015.Vilares, D., Alonso, M. A., & Gómez-Rodríguez, C. (2013). LyS at TASS 2013: Analysing Spanish tweets by means of dependency parsing, semantic-oriented lexicons and psychometric word-properties. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Vilares, D., Doval, Y., Alonso, M. A. & Gómez-Rodríguez, C. (2014). LyS at TASS 2014: A prototype for extracting and analysing aspects from Spanish tweets. In Proceedings of the TASS workshop at SEPLN 2014.Vilares, D., Doval, Y., Alonso, M. A., & Gómez-Rodríguez, C. (2015). LyS at TASS 2015: Deep learning experiments for sentiment analysis on Spanish tweets. In Proceedings of TASS 2015: Workshop on sentiment analysis at SEPLN co-located with 31st SEPLN conference (SEPLN 2015) (pp. 47–52), Alicante, Spain, September 15, 2015.Villar Rodríguez, E., Torre Bastida, A. I., García Serrano, A., & González Rodríguez, M. (2013). TECNALIA-UNED @ TASS: Uso de un enfoque lingüístico para el análisis de sentimientos. In Proceedings of the TASS workshop at SEPLN 2013, IV Congreso Español de Informática.Villena-Román, J., García Morera, J., García Cumbreras, MÁ., Martínez Cámara, E., Martín Valdivia, M. T., & Ureña López, L. A. (2013a). Workshop on sentiment analysis at SEPLN 2013: An overview. In Proceedings of the TASS workshop at SEPLN 2013, Villena-Román, Julio; García Morera, Janine; García Cumbreras, Miguel Ángel; Martínez Cámara, Eugenio; Martín Valdivia, M. Teresa; Ureña López, L. Alfonso.Villena-Román, J., Lana-Serrano, S., Martínez-Cámara, E., & González-Cristóbal, J. C. (2013b). TASS-workshop on sentiment analysis at SEPLN. Procesamiento del Lenguaje Natural, 50, 37–44.Villena-Román, J., García Morera, J., García Cumbreras, MÁ., Martínez Cámara, E., Martín Valdivia, M. T., & Ureña López, L.A. (2014). Workshop on sentiment analysis at SEPLN: Overview. In Proceedings of the TASS workshop at SEPLN 2014, Villena-Román, Julio; García Morera, Janine; García Cumbreras, Miguel Ángel; Martínez Cámara, Eugenio; Martín Val

    Language identification of multilingual posts from Twitter: a case study

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10115-016-0997-xThis paper describes a method for handling multi-class and multi-label classification problems based on the support vector machine formalism. This method has been applied to the language identification problem in Twitter. The system evaluation was performed mainly on a Twitter data set developed in the TweetLID workshop. This data set contains bilingual tweets written in the most commonly used Iberian languages (i.e., Spanish, Portuguese, Catalan, Basque, and Galician) as well as the English language. We address the following problems: (1) social media texts. We propose a suitable tokenization that processes the peculiarities of Twitter; (2) multilingual tweets. Since a tweet can belong to more than one language, we need to use a multi-class and multi-label classifier; (3) similar languages. We study the main confusions among similar languages; and (4) unbalanced classes. We propose threshold-based strategy to favor classes with less data. We have also studied the use of Wikipedia and the addition of new tweets in order to increase the training data set. Additionally, we have tested our system on Bergsma corpus, a collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari alphabets. To our knowledge, we obtained the best results published on the TweetLID data set and results that are in line with the best results published on Bergsma data set.This work has been partially funded by the project ASLP-MULAN: Audio, Speech and Language Processing for Multimedia Analytics (MINECO TIN2014-54288-C4-3-R).Pla Santamaría, F.; Hurtado Oliver, LF. (2016). Language identification of multilingual posts from Twitter: a case study. Knowledge and Information Systems. 51(3):965-989. https://doi.org/10.1007/s10115-016-0997-xS965989513Baldwin T, Lui M (2010) Language identification: the long and the short of the matter. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, HLT ‘10. Association for Computational Linguistics, Stroudsburg, PA, pp 229–237Bergsma S, McNamee P, Bagdouri M, Fink C, Wilson T (2012) Language identification for creating language-specific twitter collections. In: Proceedings of the second workshop on language in social media, LSM ‘12. Association for Computational Linguistics, Stroudsburg, PA, pp 65–74Carter S, Weerkamp W, Tsagkias M (2013) Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang Resour Eval 47(1):195–215Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, pp. 161–175Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297Gamallo P, García M, Sotelo S, Campos JRP (2014) Comparing ranking-based and naive bayes approaches to language detection on tweets. ‘TweetLID@SEPLN’, pp 12–16Goldszmidt M, Najork M, Paparizos S (2013) Boot-strapping language identifiers for short colloquial postings. In: Proceeding of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD 2013). SpringerGrefenstette G (1995) Comparing two language identification schemes. In: 3rd international conference on statistical analysis of textural dataHurtado LF, Pla F, Giménez M, Arnal ES (2014) Elirf-upv en tweetlid: Identificación del idioma en twitter, In: Proceedings of the Tweet language identification workshop co-located with 30th conference of the Spanish society for natural language processing, TweetLID@SEPLN 2014, Girona, 16 Sept 2014, pp 35–38Jauhiainen T, Lindén K, Jauhiainen H (2015) Language set identification in noisy synthetic multilingual documents. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, vol 9041 of lecture notes in computer science. Springer International Publishing, pp 633–643Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning, no. 1398. Springer, Heidelberg, pp 137–142Liu B (2012) Sentiment analysis and opinion mining. A comprehensive introduction and survey. Morgan & Claypool Publishers, San RafaelLjubešić N, Mikelić N, Boras D (2007) Language identification: How to distinguish similar languages, In: Lužar-Stifter V, Hljuz Dobrić V (eds), Proceedings of the 29th international conference on information technology interfaces. SRCE University Computing Centre, Zagreb, pp 541–546Lui M, Baldwin T (2014) Accurate language identification of twitter messages. In: Proceedings of the EACL 2014 workshop on language analysis in social media (LASM 2014), pp 17–25Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Trans Assoc Comput Linguist 2:27–40Nguyen D, Dogruoz AS (2014) Word level language identification in online multilingual communication. In: Proceedings of the 2013 conference on empirical methods in natural language processingO’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for twitter. In: Cohen WW, Gosling S (eds) Proceedings of the fourth international conference on weblogs and social media, ICWSM 2010, Washington, DC. The AAAI Press, 23–26 May 2010Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830Pla F, Hurtado L-F (2014) Political tendency identification in twitter using sentiment analysis techniques. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, pp 183–192Prager JM (1999) Linguini: language identification for multilingual documents. J Manage Inf Syst 16(3):71–101Ramón Quevedo J, Luaces O, Bahamonde A (2012) Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn 45(2):876–883Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, SMUC ‘10. ACM, New York, NY, pp 37–44Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 2007:1–13Zubiaga A, Vicente IS, Gamallo P, Campos JRP, Loinaz IA, Aranberri N, Ezeiza A Fresno-Fernández V (2014) Overview of tweetlid: Tweet language identification at SEPLN 2014. In: Proceedings of the Tweet language identification workshop co-located with 30th conference of the Spanish society for natural language processing. TweetLID@SEPLN 2014, Girona, Spain, 16 Sept 2014, pp 1–11Zubiaga A, San Vicente I, Gamallo P, Pichel JR, Alegria I, Aranberri N, Ezeiza A, Fresno V (2015) TweetLID: a benchmark for tweet language identification. J Lang Res Eval. Springer, pp 1–38. doi: 10.1007/s10579-015-9317-

    Transformer based contextualization of pre-trained word embeddings for irony detection in Twitter

    Full text link
    [EN] Human communication using natural language, specially in social media, is influenced by the use of figurative language like irony. Recently, several workshops are intended to explore the task of irony detection in Twitter by using computational approaches. This paper describes a model for irony detection based on the contextualization of pre-trained Twitter word embeddings by means of the Transformer architecture. This approach is based on the same powerful architecture as BERT but, differently to it, our approach allows us to use in-domain embeddings. We performed an extensive evaluation on two corpora, one for the English language and another for the Spanish language. Our system was the first ranked system in the Spanish corpus and, to our knowledge, it has achieved the second-best result on the English corpus. These results support the correctness and adequacy of our proposal. We also studied and interpreted how the multi-head self-attention mechanisms are specialized on detecting irony by means of considering the polarity and relevance of individual words and even the relationships among words. This analysis is a first step towards understanding how the multi-head self-attention mechanisms of the Transformer architecture address the irony detection problem.This work has been partially supported by the Spanish Ministerio de Ciencia, Innovacion y Universidades and FEDER founds under project AMIC (TIN2017-85854-C4-2-R) and the GiSPRO project (PROMETEU/2018/176). Work of Jose-Angel Gonzalez is financed by Universitat Politecnica de Valencia under grant PAID-01-17.González-Barba, JÁ.; Hurtado Oliver, LF.; Pla Santamaría, F. (2020). Transformer based contextualization of pre-trained word embeddings for irony detection in Twitter. Information Processing & Management. 57(4):1-15. https://doi.org/10.1016/j.ipm.2020.102262S115574Farías, D. I. H., Patti, V., & Rosso, P. (2016). Irony Detection in Twitter. ACM Transactions on Internet Technology, 16(3), 1-24. doi:10.1145/2930663Greene, R., Cushman, S., Cavanagh, C., Ramazani, J., & Rouzer, P. (Eds.). (2012). The Princeton Encyclopedia of Poetry and Poetics. doi:10.1515/9781400841424Van Hee, C., Lefever, E., & Hoste, V. (2018). We Usually Don’t Like Going to the Dentist: Using Common Sense to Detect Irony on Twitter. Computational Linguistics, 44(4), 793-832. doi:10.1162/coli_a_00337Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.1735Joshi, A., Bhattacharyya, P., & Carman, M. J. (2017). Automatic Sarcasm Detection. ACM Computing Surveys, 50(5), 1-22. doi:10.1145/3124420Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations.Mohammad, S. M., & Turney, P. D. (2012). CROWDSOURCING A WORD-EMOTION ASSOCIATION LEXICON. Computational Intelligence, 29(3), 436-465. doi:10.1111/j.1467-8640.2012.00460.xMuecke, D. C. (1978). Irony markers. Poetics, 7(4), 363-375. doi:10.1016/0304-422x(78)90011-6Potamias, R. A., Siolas, G., & Stafylopatis, A. (2019). A transformer-based approach to irony and sarcasm detection. arXiv:1911.10401.Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018). A survey on author profiling, deception, and irony detection for the Arabic language. Language and Linguistics Compass, 12(4), e12275. doi:10.1111/lnc3.12275Sulis, E., Irazú Hernández Farías, D., Rosso, P., Patti, V., & Ruffo, G. (2016). Figurative messages and affect in Twitter: Differences between #irony, #sarcasm and #not. Knowledge-Based Systems, 108, 132-143. doi:10.1016/j.knosys.2016.05.035Wilson, D., & Sperber, D. (1992). On verbal irony. Lingua, 87(1-2), 53-76. doi:10.1016/0024-3841(92)90025-eYus, F. (2016). Propositional attitude, affective attitude and irony comprehension. Pragmatics & Cognition, 23(1), 92-116. doi:10.1075/pc.23.1.05yusZhang, S., Zhang, X., Chan, J., & Rosso, P. (2019). Irony detection via sentiment-based transfer learning. Information Processing & Management, 56(5), 1633-1644. doi:10.1016/j.ipm.2019.04.00

    Choosing the right loss function for multi-label Emotion Classification

    Full text link
    [EN] Natural Language Processing problems has recently been benefited for the advances in Deep Learning. Many of these problems can be addressed as a multi-label classification problem. Usually, the metrics used to evaluate classification models are different from the loss functions used in the learning process. In this paper, we present a strategy to incorporate evaluation metrics in the learning process in order to increase the performance of the classifier according to the measure we are interested to favor. Concretely, we propose soft versions of the Accuracy, micro-F-1, and macro-F-1 measures that can be used as loss functions in the back-propagation algorithm. In order to experimentally validate our approach, we tested our system in an Emotion Classification task proposed at the International Workshop on Semantic Evaluation, SemEval-2018. Using a Convolutional Neural Network trained with the proposed loss functions we obtained significant improvements both for the English and the Spanish corpora.This work has been partially supported by the Spanish MINECO and FEDER founds under project AMIC (TIN2017-85854-C4-2-R) and the GiSPRO project (PROMETEU/2018/176). Work of Jose-Angel Gonzalez is also financed by Universitat Politecnica de Valencia under grant PAID-01-17.Hurtado Oliver, LF.; González-Barba, JÁ.; Pla Santamaría, F. (2019). Choosing the right loss function for multi-label Emotion Classification. Journal of Intelligent & Fuzzy Systems. 36(5):4697-4708. https://doi.org/10.3233/JIFS-179019S46974708365Baccianella S. , Esuli A. and Sebastiani F. , Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining, In in Proc of LREC, 2010.Bilmes J. , Asanovic K. , Chin C.-W. and Demmel J. , Using phipac to speed error back-propagation learning, In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, 1997, pp. 4153–4156.Cruz, F. L., Troyano, J. A., Pontes, B., & Ortega, F. J. (2014). Building layered, multilingual sentiment lexicons at synset and lemma levels. Expert Systems with Applications, 41(13), 5984-5994. doi:10.1016/j.eswa.2014.04.005Dembczynski K. , Jachnik A. , Kotlowski W. , Waegeman W. and Huellermeier E. , Optimizing the F-Measure in Multi-Label Classification: Plug-in Rule Approach versus Structured Loss Minimization, In DasguptaS. and McAllester D., editors, Proceedings of the 30th International Conference on Machine Learning volume 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, PMLR, 2013, pp. 1130–1138.Goodfellow I. , Bengio Y. and Courville A. , Deep Learning, MIT Press, http://www.deeplearningbook.org (2016).Hu M. and Liu B. , Mining and summarizing customer reviews, In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, New York, NY, USA, ACM, 2004, pp. 168–177.Ioffe S. and Szegedy C. , Batch normalization: Accelerating deep network training by reducing internal covariate shift, CoRR, abs/1502.03167 (2015).Janocha K. and Czarnecki W.M. , On loss functions for deep neural networks in classification, CoRR, abs/1702.05659 (2017).Krieger M. and Ahn D. , Tweetmotif: Exploratory search and topic summarization for twitter, In Proc of AAAI Conference on Weblogs and Social, 2010.Liu B. , Sentiment Analysis and Opinion Mining, A Comprehensive Introduction and Survey. Morgan & Claypool Publishers, 2012.Mikolov T. , Sutskever I. , Chen K. , Corrado G. and Dean J. , Distributed representations of words and phrases and their compositionality, CoRR, abs/1310.4546 (2013a).Mikolov T. , Chen K. , Corrado G. and Dean J. , Efficient estimation of word representations in vector space, CoRR, abs/1301.3781, 2013b.Mohammad S. , #emotional tweets, In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Montréal, Canada. Association for Computational Linguistics, 2012, pp. 246–255.Mohammad S. , Kiritchenko S. , Sobhani P. , Zhu X. and Cherry C. , Semeval-task 6: Detecting stance in tweets, In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, pp. 31–41.Mohammad S.M. and Bravo-Marquez F. , WASSA-shared task on emotion intensity, CoRR, abs/1708.03700, 2017.Mohammad, S. M., & Turney, P. D. (2012). CROWDSOURCING A WORD-EMOTION ASSOCIATION LEXICON. Computational Intelligence, 29(3), 436-465. doi:10.1111/j.1467-8640.2012.00460.xMohammad, S. M., Sobhani, P., & Kiritchenko, S. (2017). Stance and Sentiment in Tweets. ACM Transactions on Internet Technology, 17(3), 1-23. doi:10.1145/3003433Mohammad S.M. , Bravo-Marquez F. , Salameh M. and Kiritchenko S. , Semeval-2018 Task 1: Affect in tweets, In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, 2018.Molina-González, M. D., Martínez-Cámara, E., Martín-Valdivia, M.-T., & Perea-Ortega, J. M. (2013). Semantic orientation for polarity classification in Spanish reviews. Expert Systems with Applications, 40(18), 7250-7257. doi:10.1016/j.eswa.2013.06.076Nair V. and Hinton G.E. , Rectified linear units improve restricted boltzmann machines, In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, USA, 2010, pp. 807–814. Omnipress.NielsenF.Å., AFINN, 2011.Pastor-Pellicer J. , Zamora-Martínez F. , España Boquera S. and Castro Bleda M.J. , F-Measure as the Error Function to Train Neural Networks, In IWANN Proceedings, 2013.Pennebaker J. , Chung C. , Ireland M. , Gonzales A. and Booth R. , The development and psychological properties of liwc2007, 2014.Pla, F., & Hurtado, L.-F. (2016). Language identification of multilingual posts from Twitter: a case study. Knowledge and Information Systems, 51(3), 965-989. doi:10.1007/s10115-016-0997-xRosenthal S. , Farra N. and Nakov P. , SemEval-2017 task 4: Sentiment analysis in Twitter, In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval ’17, Vancouver, Canada, Association for Computational Linguistics, 2017.Saralegi X. and San I. , Vicente, Elhuyar at tass 2013, In XXIX Congreso de la Sociedad Espaola de Procesamiento de Lenguaje Natural, Workshop on Sentiment Analysis at SEPLN (TASS2013), 2013, pp. 143–150.Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. doi:10.1145/505282.505283Taulé M. , Martí M. , Rangel F. , Rosso P. , Bosco C. and Patti V. , Overview of the task of Stance and Gender Detection in Tweets on Catalan Independence at IBEREVAL 2017, In Notebook Papers of 2nd SEPLN Workshop on Evaluation of Human Language Technologies for Iberian Languages (IBEREVAL), Murcia (Spain). CEUR Workshop Proceedings. CEUR-WS.org, 2017, 2017.Wiebe J. , Wilson T. and Cardie C. , Annotating expressions of opinions and emotions in language, Language Resources and Evaluation 1(2) (2005).Wilson T. , Wiebe J. and Hoffmann P. , Recognizing contextual polarity in phrase-level sentiment analysis, In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, Stroudsburg, PA, USA, 2005, pp. 347–354. Association for Computational Linguistics.Zhang Y. and Wallace B. , A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification, In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 253–263. Asian Federation of Natural Language Processing

    Self-attention for Twitter sentiment analysis in Spanish

    Full text link
    [EN] This paper describes our proposal for Sentiment Analysis in Twitter for the Spanish language. The main characteristics of the system are the use of word embedding specifically trained from tweets in Spanish and the use of self-attention mechanisms that allow to consider sequences without using convolutional nor recurrent layers. These self-attention mechanisms are based on the encoders of the Transformer model. The results obtained on the Task 1 of the TASS 2019 workshop, for all the Spanish variants proposed, support the correctness and adequacy of our proposal.This work has been partially supported by the Spanish MINECO and FEDER founds under project AMIC (TIN2017-85854-C4-2-R) and the GiSPRO project (PROMETEU/2018/176). Work of Jose-Angel Gonzalez is financed by Universitat Politecnica de Valencia under grant PAID-01-17.González-Barba, JÁ.; Hurtado Oliver, LF.; Pla Santamaría, F. (2020). Self-attention for Twitter sentiment analysis in Spanish. Journal of Intelligent & Fuzzy Systems. 39(2):2165-2175. https://doi.org/10.3233/JIFS-179881S21652175392Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.173

    ELiRF-UPV at TASS 2020: TWilBERT for Sentiment Analysis and Emotion Detection in Spanish Tweets

    Full text link
    [EN] This paper describes the participation of the ELiRF research group of the Universitat Politècnica de València in the TASS 2020 Workshop, framed within the XXXVI edition of the International Conference of the Spanish Society for the Processing of Natural Language (SEPLN). We present the approach used for the Monolingual Sentiment Analysis and Emotion Detection tasks of the workshop, as well as the results obtained. Our participation has focused mainly on employing an adaptation of BERT for text classification on the Twitter domain and the Spanish language. This system, that we have called TWilBERT, shown systematic improvements of the state of the art in almost all the tasks framed in the SEPLN conference of previous years, and also obtains the most competitive performance in the tasks addressed in this work.This work has been partially supported by the Spanish MINECO and FEDER founds under project AMIC (TIN2017-85854-C4-2-R) and by the GiSPRO project (PROMETEU/2018/176). Work of José-Ángel González is financed by Universitat Politècnica de València under grant PAID-01-17.González-Barba, JÁ.; Arias-Moncho, J.; Hurtado Oliver, LF.; Pla Santamaría, F. (2020). ELiRF-UPV at TASS 2020: TWilBERT for Sentiment Analysis and Emotion Detection in Spanish Tweets. CEUR. 179-186. http://hdl.handle.net/10251/17855817918

    Aspect-based sentiment analysis using ontologies and machine learning

    Get PDF
    [EN] In this paper, we present an aspect-based sentiment analysis system that allows to automatically extract the characteristics of an opinion and to determine their associated polarity. The proposed system is based on a model that uses domain ontologies for the detection of aspects and a classifier based on the Support Vector Machines formalism for assigning the polarity to the detected aspects. The experimental work was conducted using the dataset developed for Task 5, Sentence-level ABSA in SemEval 2016 for Spanish. The proposed system has obtained a 73.07 in F1 in the aspect extraction subtask (slot2) and a 46.24 of F1 in the categorization and aspect extraction subtask (slot1,2) using an ontology-based approach. For the sentiment classification subtask (slot3) an 84.79% in terms of Accuracy has been obtained using an approach based on Support Vector Machines and polarity lexicons. These results are better than those reported in SemEval.[ES] En este artículo se presenta un sistema de análisis de sentimientos a nivel de aspecto que permite extraer automáticamente las características de una opinión y determinar la polaridad asociada. El sistema propuesto está basado en un modelo que utiliza ontologías de dominio para la detección de los aspectos y un clasificador basado en Máquinas de Soporte Vectorial para la asignación de la polaridad a los aspectos detectados. El trabajo experimental se ha realizado utilizando el conjunto de datos desarrollado para la Tarea 5, Sentence-level ABSA en SemEval 2016 para el español. El sistema propuesto ha obtenido un 73.07 en F1 en la extracción de aspectos (slot2) y un 46.24 de F1 en la subtarea conjunta de categorización y extracción de aspectos (slot1,2) utilizando una aproximación basada en ontologías. Para la subtarea de clasificación de sentimientos (slot3) se ha obtenido una Accuracy de 84.79 % utilizando una aproximación basada en el uso de Máquinas de Soporte Vectorial y lexicones de polaridad. Estos valores superan los mejores resultados obtenidos en SemEval.Este trabajo ha sido parcialmente subvencionado por el proyecto ASLP-MULAN: Audio, Speech and Language Processing for Multimedia Analytics (MINECO TIN2014-54288-C4-3-R y fondos FEDER). La estancia realizada, de enero a marzo de 2017, por Carlos Henríquez en la UPV, ha sido subvencionado por el programa Colciencias (convocatoria 727), Universidad Nacional de Medellín y Universidad Autónoma del Caribe Barran-quilla.Carlos Henríquez; Pla Santamaría, F.; Hurtado Oliver, LF.; Jaime Guzmán (2017). Análisis de sentimientos a nivel de aspecto usando ontologías y aprendizaje automático. PROCESAMIENTO DEL LENGUAJE NATURAL. (59):49-56. http://hdl.handle.net/10251/101450S49565

    ELIRF at MEDIAEVAL 2013: Similar Segments of Social Speech Task

    Full text link
    This paper describes the Natural Language Engineering and Pattern Recognition group (ELiRF) approaches and results towards the Similar Segments of Social Speech Task of Me- diaEval 2013. The task involves finding segments similar to a query segment in a multimedia collection of informal, un- structured dialogs among members of a small community. Our approach has two phases. In a first phase a preprocess of the sentences is performed based on the morphology and semantics of the words. In a second phase, a searching pro- cess based on different distance measures is carried out. This has been done taking the correctly transcribed sentences and the output of an Automatic Speech Recognizer.Work funded by the Spanish Government and the E.U. under the contracts TIN2011-28169-C05 and TIN2012-38603- C02, and FPU Grant AP2010-4193García Granada, F.; Sanchís Arnal, E.; Calvo Lance, M.; Pla Santamaría, F.; Hurtado Oliver, LF. (2013). ELIRF at MEDIAEVAL 2013: Similar Segments of Social Speech Task. CEUR Workshop Proceedings. 1043:135-136. http://hdl.handle.net/10251/38151S135136104

    Etiquetado léxico y análisis sintáctico superficial basado en modelos estadísticos

    Full text link
    El objetivo general de todo sistema de Procesamiento del Lenguaje Natural (PLN) es el de obtener alguna representación del mensaje contenido de las frases. el tratamiento automático de una lengua es un problema de gran complejidad en el que intervienen diversas y complejas fuentes de conocimiento: fonética, morfología, sintaxis, semántica, pragmática, conocimiento del mundo, etc. Aunque en algunos casos estas fuentes de información se pueden considerar independientes, en general, presentan una interrelación, sin la cual, no se puede conseguir una correcta interprestación del significado y de la función de las palabras de una oración. Debido a esta complejidad, para abordar el problema de comprensión de una lengua se suele seguir una de las siguientes vias: 1) Se resuelven ciertos subproblemas más sencillos que, en algunos casos, deben aportar simplificaciones para poder ser tratados de manera automática, tales como:análisis morfológico, etiquetado léxico de textos, análisis sintáctico superficial de oraciones ligamiento preposicional, sesambiguación del sentido de las palabras, tratamiento de fenómenos lingüistícos especificos como la anáfora, elipsis, etc. 2) se simplifica el lenguaje considerando tareas restringidasm en la talla del vocabulario, la complejidad de las estructuras sintácticas utilizadas o el dominio semántico de la aplicación. Durante los últimos años podemos encontrar una gran cantidad de ejemplos que toman alguna de las vías comentadas. En reconocimiento del habla hay aplicaciones que se restringen a vocabularios acotados, sonsultas a bases de datos específicas, sistemas de diálogo sobre tareas concretas, etc. En otros campos, más directamente relacionados con el PLN, encontramos aplicaciones de traducción automática, extracción y recuperación de información, resúmenes de textos, etc, en las que, en mayor o menor medida, se restringen a dominios específicos para sonseguir resultados aceptables. Por otra parte, el echo de disponer de grandes corpus de datos, textuales u orales, anotados con información lingüística de diferente naturaleza- información morfosintáctica, análisis sintáctico total o parcial, información semantica - junto con operativos, ha proporcionado la aparición y uso de aproximaciones inductivas o métodos basados en corpus, dentro del campo de la Lingüística Computacional, que aplicamos a diferentes tareas de PLN obtienen un alto grado de prestaciones. Las aproximaciones inductivas, con o sin información estadística, resultan de gran interés para conseguir la desambiguación del Lenguaje Natural (LN) ya que, además de proporcionar resultados aceptables, utilizan modelos relativamente sencillos y sus parámetros se pueden estimar a partir de datos. Esto las hace especialmente atractivas, puesto que en el cambio de una tarea a otra, o incluso de lengua, se reduce substancialmente la intervención humana. No obstante, algunos casos de ambigüedad no pueden ser resueltos de esta forma y se debe recurrir a un experto humano para introducir, por ejemplo, ciertas reglas o restricciones que ayuden a su resolución.Pla Santamaría, F. (2000). Etiquetado léxico y análisis sintáctico superficial basado en modelos estadísticos [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/5646Palanci

    TWilBert: Pre-trained deep bidirectional transformers for Spanish Twitter

    Full text link
    [EN] In recent years, the Natural Language Processing community have been moving from uncontextualized word embeddings towards contextualized word embeddings. Among these contextualized architectures, BERT stands out due to its capacity to compute bidirectional contextualized word representations. However, its competitive performance in English downstream tasks is not obtained by its multilingual version when it is applied to other languages and domains. This is especially true in the case of the Spanish language used in Twitter. In this work, we propose TWiLBERT, a specialization of BERT architecture both for the Spanish language and the Twitter domain. Furthermore, we propose a Reply Order Prediction signal to learn inter-sentence coherence in Twitter conversations, which improves the performance of TWilBERT in text classification tasks that require reasoning on sequences of tweets. We perform an extensive evaluation of TWilBERT models on 14 different text classification tasks, such as irony detection, sentiment analysis, or emotion detection. The results obtained by TWilBERT outperform the state-of-the-art systems and Multilingual BERT. In addition, we carry out a thorough analysis of the TWilBERT models to study the reasons of their competitive behavior. We release the pre-trained TWilBERT models used in this paper, along with a framework for training, evaluating, and fine-tuning TWilBERT models.This work has been partially supported by the Spanish Ministerio de Ciencia, Innovacion y Universidades and FEDER founds under project AMIC (TIN2017-85854-C4-2-R), and the Generalitat Valenciana under GiSPRO (PROMETEU/2018/176) and GUAITA (INNVA1/2020/61) projects. Work of Jose Angel Gonzalez is financed by Universitat Politecnica de Valencia under grant PAID-01-17.González-Barba, JÁ.; Hurtado Oliver, LF.; Pla Santamaría, F. (2021). TWilBert: Pre-trained deep bidirectional transformers for Spanish Twitter. Neurocomputing. 426:58-69. https://doi.org/10.1016/j.neucom.2020.09.078S586942
    corecore